OK Maybe one more iteration.  This is a bit cleaner than my previous 12P solution with an extra cycle and some area saved.  I really feel like 8P should be possible but I am not seeing an efficient way to get the input out of the way in order to make it possible.  